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I. Basis of the report 

1 . With regard to the elements of the international application (Replacement sheets which have been furnished to 
the receiving Office in response to an invitation under Article 14 are referred to in this report as "originally filed" 
and are not annexed to this report since they do not contain amendments (Rules 70. 16 and 70. 17)): 

Description, Pages 

1 -22 as originally filed 

Claims, Numbers 

1-14 received on 05.08.2004 with letter of 05.08.2004 

Drawings, Sheets 

1/7-7/7 as originally filed 

2. With regard to the language, all the elements marked above were available or furnished to this Authority in the 
language in which the international application was filed, unless otherwise indicated under this item. 

These elements were available or furnished to this Authority in the following language: , which is: 

□ the language of a translation furnished for the purposes of the international search (under Rule 23.1 (b)). 

□ the language of publication of the international application (under Rule 48.3(b)). 

□ the language of a translation furnished for the purposes of international preliminary examination (under 
Rule 55.2 and/or 55.3). 

3. With regard to any nucleotide and/or amino acid sequence disclosed in the international application, the 
international preliminary examination was carried out on the basis of the sequence listing: 

□ contained in the international application in written form. 

□ filed together with the international application in computer readable form. 

□ furnished subsequently to this Authority in written form. 

□ furnished subsequently to this Authority in computer readable form. 

□ The statement that the subsequently furnished written sequence listing does not go beyond the disclosure 
in the international application as filed has been furnished. 

□ The statement that the information recorded in computer readable form is identical to the written sequence 
listing has been furnished. 

4. The amendments have resulted in the cancellation of: 

□ the description, pages: 

□ the claims, Nos.: 

□ the drawings, sheets: 
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PCT/EP2004/D001 04 



5. □ This report has been established as if (some of) the amendments had not been made, since they have 

been considered to go beyond the disclosure as filed (Rule 70.2(c)). 

(Any replacement sheet containing such amendments must be referred to under item 1 and annexed to this 
report.) 

6. Additional observations, if necessary: 

V. Reasoned statement under Article 35(2) with regard to novelty, inventive step or industrial applicability; 
citations and explanations supporting such statement 

1. Statement 

Novelty (N) Yes: Claims 1-14 

No: Claims 

Inventive step (IS) Yes: Claims 1-3,5-7,13-14 

No: Claims 4,8-12 

Industrial applicability (IA) Yes: Claims 1-14 

No: Claims 

2. Citations and explanations 
see separate sheet 
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Re Item V 

1 . Reference is made to the following documents: 

D1 : WO 02/29784 A (CLARITY LLC ; ERTEN GAMZE (US)) 1 1 April 2002 (2002- 
04-11) 

D2: WO 02/084644 A (DEUTSCHE TELECOM AG) 24 October 2002 (2002-1 0-24) 



2. The document D1 is regarded as being the closest prior art to the subject-matter of 
claim 1, and shows (abstract, fig. 11,13-15) an audio-visual speech processing 
system in which the audio and video signals are analysed in parallel, and information 
from the video signal is used for speech detection and the selection of filters for noise 
removal in the audio signal. 

The subject-matter of claim 1 differs from this known system in that it also provides a 
multi-channel acoustic cancellation unit being specially adapted to perform near-end 
speaker detection and double-talk detection. 

The subject-matter of claim 1 is therefore new (Article 33(2) PCT). 

The problem to be solved by the present invention may be regarded as how to 
perform near-end speaker detection and double talk detection in an audio-visual 
interface. 

The solution to this problem proposed in claim 1 of the present application is 
considered as involving an inventive step (Article 33(3) PCT) because it is not 
obvious that the skilled person would use both audio and visual features to perform 
these detection. The skilled person would more likely use audio only. 



Claims 2, 3 are dependent on claim 1 and as such also meet the requirements of the 
PCT with respect to novelty and inventive step. Claim 13 and 14 are claims 
corresponding to the use of the system of claim 1 and as such also meet the 
requirements of the PCT with respect to novelty and inventive step. 
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3. The present application does not meet the criteria of Article 33(1) PCT, because the 
subject-matter of claim 4 does not involve an inventive step in the sense of Article 
33(3) PCT. 

3.1 The same document D1 is regarded as being the closest prior art to the 
subject-matter of claim 4 (see above disclosure an passages). 

The subject-matter of independent claim 4 differs from the disclosure of D1 in that the 
noise reduction algorithm is a spectral subtraction method using the noise signal 
estimated during speech pauses while D1 uses filters dependent on the recognized 
visemes. This difference is only a simplification of the noise reduction system, coming 
back to a standard known method. 

D2 (abstract) discloses this known noise estimation and spectral subtraction solution. 
The features disclosed in D1 and D2 would therefore be combined by the skilled 
person without exercise of any inventive skills in order to solve the corresponding 
problem. The proposed solution in independent claim 4 thus cannot be considered 
inventive (Article 33(3) PCT). 

3.2 Dependent claims 8-12 do not contain any features which, in combination with the 
features of any claim to which they refer, meet the requirements of the PCT in 
respect of inventive step, see documents D1-2 and the corresponding passages cited 
in the search report. 

3.3 The combination of the features of dependent claims 5-7 is neither known from, nor 
rendered obvious by, the available prior art. However, this combination results in the 
same subject-matter as defined by independent claim 1 . 

Re Item VII. 

1 . Contrary to the requirements of Rule 5.1(a)(ii) PCT, the relevant background art 
disclosed in the documents D1-5 is not mentioned in the description, nor are these 
documents identified therein. 
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Re Item VIII. 

1 . Claim 1 does not meet the requirements of Article 6 PCT in that the matter for which 
protection is sought is not clearly defined. The following functional statements do not 
enable the skilled person to determine which technical features are necessary to per- 
form the stated function: "perform a near-end speaker detection and double-talk 
detection algorithm based on ..." 

The claim attempts to define the subject-matter in terms of the result to be achieved, 
which merely amounts to a statement of the underlying problem, without providing the 
technical features necessary for achieving this result. 



Form PCT/Separate Sheet/409 (Sheet 3) (EPO-April 1997) 



3 * • * EP04QD104 

10/542869 

JCUBec'dPCT/FTQ 2 0 JUL 2003 

PCT/EP2004/000104 

SONY ERICSSON MOBILE COMMUNICATIONS AB 
P27696WO 

5 

NEW CLAIMS 

1. A noise reduction system with an audio-visual user interface, said system being specially 
adapted for running an application for combining visual features (o VynT ) extracted from a 

10 digital video sequence (v(nT)) showing the face of a speaker (Si) with audio features .(Sa.nr) 
extracted from an analog audio sequence (s(t)) t wherein said audio sequence (s(t)) can in- 
clude noise in the environment of said speaker (Si), said noise reduction system (200b/c) 
comprising 

- means (101a, 106b) for detecting and analyzing said analog audio sequence (s(t)) y 
15 - means (101b') for detecting said video sequence (v(nT)) y and 

- means (104a+b, 104 , +104") for analyzing the detected video signal (v(nT)) y 

wherein a noise reduction circuit (106) of said noise reduction system is adapted to separate 
the speaker's voice from said background noise (n(t)) based on a combination of derived 
speech characteristics (OnvnT := lOn t »T r . 2v,nr T ] T ) and outputting a speech activity indication 
20 signal ( S ( . (nT) ) which is obtained by a combination of speech activity estimates supplied 
by said analyzing means (106b, 104a+b, 104 > +104 ,> ), 
characterized by 

a multi-channel acoustic echo cancellation unit (108) being specially adapted to perform a 
near-end speaker detection and double-talk detection algorithm based on acoustic-phonetic 
25 speech characteristics derived by said audio feature extraction and analyzing means (106b) 
and said visual feature extraction and analyzing means (104a+b, 104'+104 ,> ). 

2. A noise reduction system according to claim 1, 
characterized by 

30 means (S W) for switching off an audio channel in case the actual level of said speech ac- 
tivity indication signal (S f . (nT)) falls below a predefined threshold value. 
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3. A noise reduction system according to anyone of the claims 1 or 2, 
characterized in that 

said audio feature extraction and analyzing means (106b) is an amplitude detector. 

5 4. A near-end speaker detection method reducing the noise level of a detected analog audio 
sequence (s(t)) t 

said method being characterized by the following steps: 

- subjecting (SI) said analog audio sequence (s(/)) to an analog-to-digital conversion, 

- calculating (S2) the corresponding discrete signal spectrum (S(£A/)) of the analog-to- 
10 digital-converted audio sequence (s(nT)) by performing a Fast Fourier Transform (FFT), 

- detecting (S3) the voice of said speaker (Si) from said signal spectrum (S(kAf)) by ana- 
lyzing visual features (o V9 nT) extracted from a simultaneously with the recording of the 
analog audio sequence (s(t)) recorded video sequence (v(n7)) tracking the current loca- 
tion of the speaker's face, lip movements and/or facial expressions of the speaker (Si) in 

15 subsequent images, 

- estimating (S4) the noise power density spectrum (® w (f)) of the statistically distrib- 
uted background noise (n(t)) based on the result of the speaker detection step (S3), 

- subtracting (S5) a discretized version (<i> nn (k • A/)) of the estimated noise power den- 
sity spectrum (<E> n/l (/)) from the discrete signal spectrum (S(£ A/)) of the analog-to- 

20 digital-converted audio sequence (s(nT)\ and 

- calculating (S6) the corresponding discrete time-domain signal (S^nT)) of the obtained 
difference signal by performing an Inverse Fast Fourier Transform (IFFT), thereby 
yielding a discrete version of the recognized speech signal. 

25 5. A near-end speaker detection method according to claim 4, 
characterized by the step of 

conducting (S7) a multi-channel acoustic echo cancellation algorithm which models echo 
path impulse responses by means of adaptive finite impulse response (FIR) filters and sub- 
tracts echo signals from the analog audio sequence (s(0) based on acoustic-phonetic speech 
30 characteristics derived by an algorithm for extracting visual features (o v , nT ) from a video 
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sponding to the signal (5,*(f)) which represents said speaker's voice as well as an estimate 
(0 /in (/)) for the noise power density spectrum of the statistically distributed 

background noise (#i*(0)- 

10. A near-end speaker detection method according to claim 9, 
characterized by the step of 

correlating (S9) the discrete signal spectrum (S T (fcAf)) of a delayed version (s(nT-x)) of the 
analog-to-digital-converted audio signal (s(nT)) with a visual speech activity estimate taken 
from a visual feature vector (o VJ ) supplied by the visual feature extraction and analyzing 
means (104a+b, 104' + 104"), thereby yielding a further estimate (£,'(/)) for updating the 
estimate (£,(/)) for the frequency spectrum (S,(/)) corresponding to the signal (^(f)) 
which represents said speaker's voice as well as a further estimate (O^'C/)) for updating 
the estimate (<!>„,(/)) for the noise power density spectrum of the statistically 

distributed background noise (n(t)). 

1 1. A near-end speaker detection method according anyone of the claims 9 or 10, 
characterized by the step of 

adjusting (S10) the cut-off frequencies of a band-pass filter (204) used for filtering the dis- 
crete signal spectrum (S(k Af)) of the analog-to-digital-converted audio signal (s(t)) de- 
pendent on the bandwidth of the estimated speech signal spectrum (S i (/)) . 

12. A near-end speaker detection method according to anyone of the claims 4 to 8, 
characterized by the steps of 

— adding (SI la) an audio speech activity estimate obtained by an amplitude detection of 
25 the band-pass-filtered discrete signal spectrum (S(k &f)) of the analog-to-digital- 
converted audio signal (5(0) to a visual speech activity estimate taken from a visual 
feature vector (o VJ ) supplied by said visual feature extraction and analyzing means 
(104a+b, 104'+104"), thereby yielding an audio-visual speech activity estimate, 

- correlating (SI lb) the discrete signal spectrum (S(fcAf)) with the audio-visual speech 
30 activity estimate, thereby yielding an estimate (S, (/)) for the frequency spectrum (£(/)) 
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corresponding to the signal which represents said speaker's voice as well as an 

estimate (O^, (/)) for the noise power density spectrum (<I> n/I (/)) of the statistically 
distributed background noise (ji\t)) and 
- adjusting (Sllc) the cut-off frequencies of a band-pass filter (204) used for filtering the 
5 discrete signal spectrum (S(A?A/)) of the analog-to-digital-converted audio signal ($(/)) 

dependent on the bandwidth of the estimated speech signal spectrum (S, (/)) . 

13. Use of a noise reduction system (200b/c) according to anyone of the claims 1 to 3 and a 
near-end speaker detection method according to anyone of the claims 5 to 13 for a video- 

10 telephony based application in a telecommunication system running on a video-enabled 

phone with a built-in video camera (101b') pointing at the face of a speaker (5,) participat- 
ing in a video telephony session. 

14. A telecommunication device equipped with an audio-visual user interface, 
15 characterized by 

noise reduction system (200b/c) according to anyone of the claims 1 to 3. 
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